1 EDA using WDI

1.1 Exploratory Data Analysis, EDA

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  1. Generate questions about your data

  2. Search for answers by visualising, transforming, and/or modeling your data

  3. Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)

1.2 Workflow

  1. Importing data by WDI
df_dataframe_name <- WDI(indicators = c(name1 = "Indicator Code 1", 
name2 = "Indicator Code 2"), extra = TRUE)

Write and read:

write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")
  1. Viewing data by

head(), str(), summary(), and try df_dataframe_name. See also Environment Tab of RStudio.

  1. Transforming data by restricting the values of a variable.
df_dataframe_name |> filter(var == "value") 
df_dataframe_name |> filter(var %in% c("value_1", ... , "value_n") 
df_dataframe_name |> filter(var != "value") 
df_dataframe_name |> distinct(var)
df_dataframe_name |> drop_na(var)
  • Creating a new variable by mutation. (A little advanced. PCAP = gdp/pop)
df_dataframe_name |> mutate(var_new = var1 * var2)}
  1. Change orders by arrange()
df_dataframe_name |> arrange(var)
df_dataframe_name |> arrange(desc(var))
  1. Visualizing using ggplot() + geom_*()

    What type of variation occurs within my variables?

    What type of covariation occurs between my variables?

  • line graph
transformed_data |> ggplot(aes(year, name1)) + geom_line()
transformed_data |> ggplot(aes(year, name2)) + geom_line()
  • scatterplot
transformed_data |> ggplot(aes(name1, name2)) + geom_point()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + scale_x_log10()
  • scatterplot with a regression line
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE)
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + scale_x_log10()
  • histogram
transformed_data |> ggplot(aes(name1)) + geom_histogram()
  • boxplot

categorical_var: factor(year), income, region

transformed_data |> ggplot(aes(categorical_var, name1)) + geom_boxplot()
  1. Do not forget to add your observations and questions.

1.3 Setup

library(tidyverse)
library(WDI)

1.4 Data

List of data and its description

  1. Government expenditure on education, total (% of GDP): SE.XPD.TOTL.GD.ZS [Link]
  2. School enrollment, primary (% gross): SE.PRM.ENRR [Link]
  3. School enrollment, secondary (% gross): SE.SEC.ENRR [Link]
  4. School enrollment, tertiary (% gross): SE.TER.ENRR [Link]
WDIsearch(string = "SE.XPD.TOTL.GD.ZS", field = "indicator", short = FALSE, cache = wdicache)

1.4.1 Importing Data

df_education <- WDI(
  indicator = c(expenditure = "SE.XPD.TOTL.GD.ZS",
                primary = "SE.PRM.ENRR",
                secondary = "SE.SEC.ENRR",
                tertiary = "SE.TER.ENRR"),
  extra = TRUE, cache = wdicache)
write_csv(df_education, "data/education.csv")
df_education <- read_csv("data/education.csv")
Rows: 16758 Columns: 16── Column specification ───────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): country, iso2c, iso3c, region, capital, income, lending
dbl  (7): year, expenditure, primary, secondary, tertiary, longitude, latitude
lgl  (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.4.2 View Data

head(df_education) or df_education in R Notebook

df_education

Structure of Data: str(df_education) or glimpse(df_education)

str(df_education)
spc_tbl_ [16,758 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country    : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ iso2c      : chr [1:16758] "AF" "AF" "AF" "AF" ...
 $ iso3c      : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
 $ year       : num [1:16758] 2012 2008 2009 2004 2011 ...
 $ status     : logi [1:16758] NA NA NA NA NA NA ...
 $ lastupdated: Date[1:16758], format: "2023-12-18" "2023-12-18" ...
 $ expenditure: num [1:16758] 2.6 4.38 4.81 NA 3.46 ...
 $ primary    : num [1:16758] 106.3 103.4 99.4 106.3 100.3 ...
 $ secondary  : num [1:16758] 54.2 39.1 44.4 19.3 52.2 ...
 $ tertiary   : num [1:16758] NA NA 4.02 1.39 3.76 ...
 $ region     : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
 $ capital    : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
 $ longitude  : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
 $ latitude   : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
 $ income     : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
 $ lending    : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   iso2c = col_character(),
  ..   iso3c = col_character(),
  ..   year = col_double(),
  ..   status = col_logical(),
  ..   lastupdated = col_date(format = ""),
  ..   expenditure = col_double(),
  ..   primary = col_double(),
  ..   secondary = col_double(),
  ..   tertiary = col_double(),
  ..   region = col_character(),
  ..   capital = col_character(),
  ..   longitude = col_double(),
  ..   latitude = col_double(),
  ..   income = col_character(),
  ..   lending = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

1.4.3 Select Columns

df_ed <- df_education |> select(country, iso2c, year, expenditure, primary, secondary, tertiary, region, income, lending)
df_ed
df_ed |> filter(region == "Aggregates") |> distinct(country)

1.4.3.1 Check Data

df_ed |> group_by(year) |> summarize(expenditure = sum(!is.na(expenditure)), 
                                     primary = sum(!is.na(primary)),
                                     secondary = sum(!is.na(secondary)),
                                     tertiary = sum(!is.na(tertiary))) |>
  arrange(desc(year))

1.4.4 Visualization by Line Graphs

1.4.4.1 Government expenditure on education, total (% of GDP)

df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(expenditure) |>
  ggplot(aes(year, expenditure)) + geom_line()

df_ed |> filter(region == "Sub-Saharan Africa") |>
  group_by(year) |> summarize(expenditure = sum(!is.na(expenditure)), 
                                     primary = sum(!is.na(primary)),
                                     secondary = sum(!is.na(secondary)),
                                     tertiary = sum(!is.na(tertiary))) |>
  arrange(desc(year))
df_ed |> drop_na(expenditure) |> filter(country %in% c("Arab World", "Africa Eastern and Southern", "Africa Western and Central", "Sub-Saharan Africa", "South Asia")) |> 
  ggplot(aes(year, expenditure, col = country)) + geom_line()

df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(expenditure) |>
  ggplot(aes(year, primary)) + geom_line()

df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(primary) |>
  ggplot() + geom_line(aes(year, primary), col = "red") + geom_line(aes(year, secondary), col = "blue") + geom_line(aes(year, tertiary)) + labs(title = "School Enrollment", y = "% gross")

1.4.5 Long Table

df |> pivot_longer(cols = c(columns to gather), names_to = "name", values_to = "value")

primary:tertiary from the column primary to the column tertiary

df_ed_long <- df_ed |> pivot_longer(cols = primary:tertiary, names_to = "levels", values_to = "value") 
df_ed_long

1.4.5.1 Examples

Purchasing power parities (PPPs): [R Notebook], [Rmd]

df_gdps_long <- df_gdps |> 
  pivot_longer(cols = c("gdp_nominal", "gdp_real", "gdp_ppp"),
               names_to = "gdp", values_to = "value")
df_ed_long |> drop_na(value) |> 
  filter(country == "Sub-Saharan Africa") |>
  ggplot(aes(year, value, col = levels)) + geom_line()

df_ed_long |> drop_na(value) |> 
  filter(country %in% c("Sub-Saharan Africa", "South Asia")) |>
  ggplot(aes(year, value, col = country, linetype = levels)) + geom_line()

2 Example

2.1 School Enrollment vs GDP Per Capita

We study the relation between the school enrollment in secondary and tertiary level and the gdp per capita.

2.1.2 Importing Data

df_sec_ter_gdp <- WDI(
  indicator = c(secondary = "SE.SEC.ENRR", tertiary = "SE.TER.ENRR", 
                gdppcap = "NY.GDP.PCAP.PP.KD"), 
  extra = TRUE, cache = wdicache)
write_csv(df_sec_ter_gdp, "data/sec_ter_gdp.csv")
df_df_sec_ter_gdp <- read_csv("data/sec_ter_gdp.csv")
Rows: 16758 Columns: 15── Column specification ───────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): country, iso2c, iso3c, region, capital, income, lending
dbl  (6): year, secondary, tertiary, gdppcap, longitude, latitude
lgl  (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1.3 Viewing the data

df_sec_ter_gdp
df_sec_ter_gdp |> str()
'data.frame':   16758 obs. of  15 variables:
 $ country    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ iso2c      : chr  "AF" "AF" "AF" "AF" ...
 $ iso3c      : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ year       : int  2012 2008 2009 2004 2011 2013 2014 2010 2003 2007 ...
 $ status     : chr  "" "" "" "" ...
 $ lastupdated: chr  "2023-12-18" "2023-12-18" "2023-12-18" "2023-12-18" ...
 $ secondary  : num  54.2 39.1 44.4 19.3 52.2 ...
  ..- attr(*, "label")= chr "School enrollment, secondary (% gross)"
 $ tertiary   : num  NA NA 4.02 1.39 3.76 ...
  ..- attr(*, "label")= chr "School enrollment, tertiary (% gross)"
 $ gdppcap    : num  2123 1557 1824 1260 1961 ...
  ..- attr(*, "label")= chr "GDP per capita, PPP (constant 2017 international $)"
 $ region     : chr  "South Asia" "South Asia" "South Asia" "South Asia" ...
 $ capital    : chr  "Kabul" "Kabul" "Kabul" "Kabul" ...
 $ longitude  : chr  "69.1761" "69.1761" "69.1761" "69.1761" ...
 $ latitude   : chr  "34.5228" "34.5228" "34.5228" "34.5228" ...
 $ income     : chr  "Low income" "Low income" "Low income" "Low income" ...
 $ lending    : chr  "IDA" "IDA" "IDA" "IDA" ...

2.1.4 Creating a Long Table

df_sec_ter_gdp_long <- df_sec_ter_gdp |> 
  pivot_longer(cols = c(secondary, tertiary)) |>
  select(country, iso2c, year, gdppcap, name, value, region, income)
df_sec_ter_gdp_long

2.1.5 Visualizing by Line Graphs

COUNTRY <- "World"
df_sec_ter_gdp_long |> filter(country == COUNTRY) |> drop_na(value) |>
  ggplot() + geom_line(aes(year, value, col = name))

  labs(title = "School enrollment; Secondary and Tertiary")
$title
[1] "School enrollment; Secondary and Tertiary"

attr(,"class")
[1] "labels"

Observations:

  • The enrollments of both levels are increasing rapidly after 1990.
INCOME <- c("High income", "Upper middle income","Middle income","Lower middle income","Low & middle income", "Low income")
df_sec_ter_gdp_long |> filter(country %in% INCOME) |> drop_na(value) |>
  ggplot(aes(year, value, col = factor(country, levels = INCOME), linetype = name)) + geom_line() + ylim(c(0,110)) +
  labs(title = "School enrollment: Secondary and Tertiary", 
       col = "Incom Levels",
       linetype = "School Levels", y = "")

Observations

  • In this century, we can observe improvements in enrollments. Need to study more in detail.

2.1.6 Scatterplots for Covariation

df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  labs(title = "School enrollment: Secondary and Tertiary vs GDP per capita", y = "")

Observations

  • Need to try log10 convertion of gdppcap as many points are near the origin.
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  scale_x_log10() +
  labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale", y = "")

df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE, formula = 'y~x') +
  scale_x_log10() +
  labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale", y = "")

df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(gdppcap, value) |>
  filter(name == "tertiary") |> 
  lm(value~log10(gdppcap), data = _) |> summary()

Call:
lm(formula = value ~ log10(gdppcap), data = filter(drop_na(filter(df_sec_ter_gdp_long, 
    year == 2020), gdppcap, value), name == "tertiary"))

Residuals:
    Min      1Q  Median      3Q     Max 
-70.323  -8.289  -0.605   8.209  83.181 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -155.838     12.831  -12.14   <2e-16 ***
log10(gdppcap)   48.735      3.064   15.90   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.36 on 170 degrees of freedom
Multiple R-squared:  0.598, Adjusted R-squared:  0.5957 
F-statistic: 252.9 on 1 and 170 DF,  p-value: < 2.2e-16

Observations

  • Both secondary and tertiary level, the enrollment correlate with the log10 values of gdp per capita.

2.1.7 Box Plots

df_sec_ter_gdp_long |> filter(year == 2020, region != "Aggregates") |> drop_na(value, region) |>
  ggplot(aes(name, value, fill = region)) + geom_boxplot() + 
  labs(title = "Secondary and tertiary school enrollment by region", y = "School enrollment (% gross)", x = "", fill = "") + 
  theme(legend.position = "top")

Observations

  • Regional differences are large in tertiary level.
df_sec_ter_gdp_long |> filter(year == 2020, region != "Aggregates") |> drop_na(value, income) |>
  ggplot(aes(name, value, fill = factor(income, levels = INCOME))) + geom_boxplot() + 
  labs(title = "Secondary and tertiary school enrollment by income level", y = "School enrollment (% gross)", x = "", fill = "") + 
  theme(legend.position = "top")

Observations

  • Income level has more effect on school enrollment to both secondary and tertiary education

3 Your Project

3.1 Title of your project

The title can be in the title in YAML.

3.1.1 Short Abstract

We study …..

3.1.2 About Data

  1. Name of the indicator 1: Indicator Code 1
  • Description:
  1. Name of the indicator 2: Indicator Code 2
  • Description:

3.1.3 Set up

library(tidyverse)
library(WDI)

Create data folder if you do not have it under Files.

dir.create("data")

If you do not have wdicache.rds in your data folder, run the following two code chunks.

wdicache <- WDIcache()

3.1.4 Importing Data

Edit short_name_1, chosen_indicator_1, short_name_2, chosen_indicator_2, etc. You can edit df_yourdata to a descriptive name. If you edit df_yourdata , please edit other parts as well.

df_yourdata <- WDI(
  indicator = c(short_name_1 = chosen_indicator_1,
                short_name_2 = chosen_indicator_2),
  extra = TRUE, cache = wdicache)
write_csv(df_yourdata, "data/yourdata.csv")
df_yourdata <- read_csv("data/yourdata.csv")

3.1.5 Viewing data

3.1.6 Creating a long table

3.1.6.1 Example.

df_sec_ter_gdp_long <- df_sec_ter_gdp |> 
  pivot_longer(cols = c(secondary, tertiary)) 
df_sec_ter_gdp_long

3.1.7 Visualizing data

Observations and Questions:

Observations and Questions:

Observations and Questions:

3.1.8 A choropleth map

If possible, create a choropleth map. (a challenge, not required)

Observations and Questions:

---
title: "Education"
author: "ID Last, First"
date: "`r Sys.Date()`"
output:
  html_notebook:
    number_sections: yes
    toc: yes
    toc_float: yes
---

# EDA using WDI

## Exploratory Data Analysis, EDA

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

1.  Generate questions about your data

2.  Search for answers by visualising, transforming, and/or modeling your data

3.  Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: [EDA](https://posit.cloud/learn/primers/3.1))

## Workflow

1.  Importing data by WDI

```         
df_dataframe_name <- WDI(indicators = c(name1 = "Indicator Code 1", 
name2 = "Indicator Code 2"), extra = TRUE)
```

Write and read:

```         
write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")
```

2.  Viewing data by

`head()`, `str()`, `summary()`, and try `df_dataframe_name`. See also Environment Tab of RStudio.

3.  Transforming data by restricting the values of a variable.

```         
df_dataframe_name |> filter(var == "value") 
df_dataframe_name |> filter(var %in% c("value_1", ... , "value_n") 
df_dataframe_name |> filter(var != "value") 
df_dataframe_name |> distinct(var)
df_dataframe_name |> drop_na(var)
```

-   Creating a new variable by mutation. (A little advanced. PCAP = gdp/pop)

```         
df_dataframe_name |> mutate(var_new = var1 * var2)}
```

4.  Change orders by `arrange()`

```         
df_dataframe_name |> arrange(var)
df_dataframe_name |> arrange(desc(var))
```

5.  Visualizing using ggplot() + geom\_\*()

    What type of **variation** occurs **within** my variables?

    What type of **covariation** occurs **between** my variables?

-   line graph

```         
transformed_data |> ggplot(aes(year, name1)) + geom_line()
transformed_data |> ggplot(aes(year, name2)) + geom_line()
```

-   scatterplot

```         
transformed_data |> ggplot(aes(name1, name2)) + geom_point()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + scale_x_log10()
```

-   scatterplot with a regression line

```         
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE)
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + scale_x_log10()
```

-   histogram

```         
transformed_data |> ggplot(aes(name1)) + geom_histogram()
```

-   boxplot

`categorical_var`: `factor(year)`, `income`, `region`

```         
transformed_data |> ggplot(aes(categorical_var, name1)) + geom_boxplot()
```

6.  Do not forget to add your observations and questions.

## Setup

```{r}
library(tidyverse)
library(WDI)
```

### WDIcache() [[Link in da4r](https://icu-hsuzuki.github.io/da4r/wdidata.html?q=WDIcache#wdicache)]

If you have not download WDIcache recently, run the following two code chunks, otherwise start with the third.

```{r eval=FALSE}
wdicache <- WDIcache()
```

```{r eval=FALSE}
write_rds(wdicache, "data/wdicache.rds")
```

```{r eval=TRUE}
wdicache <- read_rds("data/wdicache.rds")
```

## Data

*List of data and its description*

1.  Government expenditure on education, total (% of GDP): SE.XPD.TOTL.GD.ZS [[Link](https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS)]
2.  School enrollment, primary (% gross): SE.PRM.ENRR [[Link](https://data.worldbank.org/indicator/SE.PRM.ENRR)]
3.  School enrollment, secondary (% gross): SE.SEC.ENRR [[Link](https://data.worldbank.org/indicator/SE.SEC.ENRR)]
4.  School enrollment, tertiary (% gross): SE.TER.ENRR [[Link](https://data.worldbank.org/indicator/SE.TER.ENRR)]

```{r}
WDIsearch(string = "SE.XPD.TOTL.GD.ZS", field = "indicator", short = FALSE, cache = wdicache)
```

### Importing Data

```{r cache = TRUE, eval = FALSE}
df_education <- WDI(
  indicator = c(expenditure = "SE.XPD.TOTL.GD.ZS",
                primary = "SE.PRM.ENRR",
                secondary = "SE.SEC.ENRR",
                tertiary = "SE.TER.ENRR"),
  extra = TRUE, cache = wdicache)
```

```{r eval = FALSE}
write_csv(df_education, "data/education.csv")
```

```{r}
df_education <- read_csv("data/education.csv")
```

### View Data

`head(df_education)` or `df_education` in R Notebook

```{r}
df_education
```

Structure of Data: `str(df_education)` or `glimpse(df_education)`

```{r}
str(df_education)
```

### Select Columns

```{r}
df_ed <- df_education |> select(country, iso2c, year, expenditure, primary, secondary, tertiary, region, income, lending)
df_ed
```

```{r}
df_ed |> filter(region == "Aggregates") |> distinct(country)
```

#### Check Data

```{r}
df_ed |> group_by(year) |> summarize(expenditure = sum(!is.na(expenditure)), 
                                     primary = sum(!is.na(primary)),
                                     secondary = sum(!is.na(secondary)),
                                     tertiary = sum(!is.na(tertiary))) |>
  arrange(desc(year))
```

### Visualization by Line Graphs

#### Government expenditure on education, total (% of GDP)

```{r}
df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(expenditure) |>
  ggplot(aes(year, expenditure)) + geom_line()
```

```{r}
df_ed |> filter(region == "Sub-Saharan Africa") |>
  group_by(year) |> summarize(expenditure = sum(!is.na(expenditure)), 
                                     primary = sum(!is.na(primary)),
                                     secondary = sum(!is.na(secondary)),
                                     tertiary = sum(!is.na(tertiary))) |>
  arrange(desc(year))
```

```{r}
df_ed |> drop_na(expenditure) |> filter(country %in% c("Arab World", "Africa Eastern and Southern", "Africa Western and Central", "Sub-Saharan Africa", "South Asia")) |> 
  ggplot(aes(year, expenditure, col = country)) + geom_line()
```

```{r}
df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(expenditure) |>
  ggplot(aes(year, primary)) + geom_line()
```

```{r}
df_ed |> filter(country == "Sub-Saharan Africa") |> drop_na(primary) |>
  ggplot() + geom_line(aes(year, primary), col = "red") + geom_line(aes(year, secondary), col = "blue") + geom_line(aes(year, tertiary)) + labs(title = "School Enrollment", y = "% gross")
```

### Long Table

```         
df |> pivot_longer(cols = c(columns to gather), names_to = "name", values_to = "value")
```

`primary:tertiary` from the column primary to the column tertiary

```{r}
df_ed_long <- df_ed |> pivot_longer(cols = primary:tertiary, names_to = "levels", values_to = "value") 
df_ed_long
```

#### Examples

Purchasing power parities (PPPs): [[R Notebook](https://ds-sl.github.io/intro2r/da4r/gdp_ppp.nb.html)], [[Rmd](https://github.com/ds-sl/intro2r/blob/main/docs/da4r/gdp_ppp.Rmd)]

```         
df_gdps_long <- df_gdps |> 
  pivot_longer(cols = c("gdp_nominal", "gdp_real", "gdp_ppp"),
               names_to = "gdp", values_to = "value")
```

```{r}
df_ed_long |> drop_na(value) |> 
  filter(country == "Sub-Saharan Africa") |>
  ggplot(aes(year, value, col = levels)) + geom_line()
```

```{r}
df_ed_long |> drop_na(value) |> 
  filter(country %in% c("Sub-Saharan Africa", "South Asia")) |>
  ggplot(aes(year, value, col = country, linetype = levels)) + geom_line()
```

## Choropleth Maps [[Link](https://icu-hsuzuki.github.io/da4r/worldmap.html#worldmap)]

Need map data and use `geom_sf`

### **Natural Earth Data**

[https://www.naturalearthdata.com](https://www.naturalearthdata.com/)

Get natural earth world country polygons

CRAN: <https://cran.r-project.org/web/packages/rnaturalearth/index.html>

Manual: <https://cran.r-project.org/web/packages/rnaturalearth/rnaturalearth.pdf>

```{r}
library(rnaturalearth)
library(rnaturalearthdata)
```

```         
ne_countries(
  scale = 110,
  type = "countries",
  continent = NULL,
  country = NULL,
  geounit = NULL,
  sovereignty = NULL,
  returnclass = c("sp", "sf")
)
```

#### **Arguments**

-   scale: scale of map to return, one of 110, 50, 10 or ‘small’, ‘medium’, ‘large’

-   type: country type, one of ‘countries’, ‘map_units’, ‘sovereignty’, ‘tiny_countries’

-   continent: a character vector of continent names to get countries from.

-   country: a character vector of country names.

-   geounit: a character vector of geounit names.

-   sovereignty: a character vector of sovereignty names.

-   returnclass: ‘sp’ default or ‘sf’ for Simple Features

```{r}
ne_countries() %>% ggplot() + geom_sf()
```

```{r}
ne_world <- ne_countries(scale = "medium", returnclass = "sf")
```

```{r}
str(ne_world)
```

```{r}
ne_world %>% ggplot() + geom_sf(aes(fill = region_wb))
```

```{r}
df_ed_2020 <- df_ed |> filter(year == 2020) |> select(iso2c, expenditure)
ne_ed_2020 <- ne_world |> left_join(df_ed_2020, by = c("wb_a2" = "iso2c"))
```

```{r}
ne_ed_2020 |> ggplot() + geom_sf(aes(fill = expenditure))
```

```{r}
ne_ed_2020 |> filter(subregion == "South-Eastern Asia") |> ggplot() + geom_sf(aes(fill = expenditure))
```

```{r}
df_ed_2020 |> drop_na(expenditure) |> pull() |> range()
```

```{r}
df_ed_2020 |> drop_na(expenditure) |> ggplot(aes(expenditure)) + geom_histogram(binwidth = 3) 
```

It is possible to use quantile.

```{r}
df_ed_2020 |> drop_na(expenditure) |> pull(expenditure) |> 
  quantile(probs = c(0.25,0.5,0.75,1))
```

```{r}
ne_ed_2020 |> mutate(level = cut(expenditure, breaks = c(0,3,6,9,20), labels = c("0-3","3-6","6-9","9-14"))) |> ggplot() + geom_sf(aes(fill = level)) + labs(title = "Government expenditure on education, total (% of GDP) in 2020")
```

```{r}
ne_ed_2020 |> filter(continent == "Africa") |> mutate(level = cut(expenditure, breaks = c(0,3,6,9,20), labels = c("0-3","3-6","6-9","9-14"))) |> ggplot() + geom_sf(aes(fill = level)) + labs(title = "Government expenditure on education, \ntotal (% of GDP) in 2020")
```

# Example

## School Enrollment vs GDP Per Capita

> We study the relation between the school enrollment in secondary and tertiary level and the gdp per capita.

### Index Search

```{r}
WDIsearch(string = "school enrollment.*(% gross)", field = "name", short = FALSE)
```

**Q to AI:** What does '.\*' mean as a regular expression?

**A of [Poe](https://poe.com/) Assistant:** In regular expressions, the pattern '.\*' is a commonly used expression that matches any sequence of characters (including an empty sequence). Here's what it means:

-   The dot '.' matches any single character except a newline.

-   The asterisk '\*' is a quantifier that matches zero or more occurrences of the preceding element (in this case, the dot).

Therefore, when '.\*' is used as a regular expression, it will match any sequence of characters, regardless of length or content. It's a way to express a wildcard or a catch-all pattern.

1.  School enrollment, secondary (% gross): SE.SEC.ENRR

-   School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers.

2.  School enrollment, tertiary (% gross): SE.TER.ENRR

-   Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Tertiary education, whether or not to an advanced research qualification, normally requires, as a minimum condition of admission, the successful completion of education at the secondary level.

3.  GDP per capita, PPP (constant 2017 international \$): NY.GDP.PCAP.PP.KD

-   GDP per capita, PPP (constant 2017 international \$) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars.

### Importing Data

```{r eval=FALSE}
df_sec_ter_gdp <- WDI(
  indicator = c(secondary = "SE.SEC.ENRR", tertiary = "SE.TER.ENRR", 
                gdppcap = "NY.GDP.PCAP.PP.KD"), 
  extra = TRUE, cache = wdicache)
```

```{r eval=FALSE}
write_csv(df_sec_ter_gdp, "data/sec_ter_gdp.csv")
```

```{r}
df_df_sec_ter_gdp <- read_csv("data/sec_ter_gdp.csv")
```

### Viewing the data

```{r}
df_sec_ter_gdp
```

```{r}
df_sec_ter_gdp |> str()
```

### Creating a Long Table

```{r}
df_sec_ter_gdp_long <- df_sec_ter_gdp |> 
  pivot_longer(cols = c(secondary, tertiary)) |>
  select(country, iso2c, year, gdppcap, name, value, region, income)
df_sec_ter_gdp_long
```

### Visualizing by Line Graphs

```{r}
COUNTRY <- "World"
df_sec_ter_gdp_long |> filter(country == COUNTRY) |> drop_na(value) |>
  ggplot() + geom_line(aes(year, value, col = name))
  labs(title = "School enrollment; Secondary and Tertiary")
```

Observations:

-   The enrollments of both levels are increasing rapidly after 1990.

```{r}
INCOME <- c("High income", "Upper middle income","Middle income","Lower middle income","Low & middle income", "Low income")
df_sec_ter_gdp_long |> filter(country %in% INCOME) |> drop_na(value) |>
  ggplot(aes(year, value, col = factor(country, levels = INCOME), linetype = name)) + geom_line() + ylim(c(0,110)) +
  labs(title = "School enrollment: Secondary and Tertiary", 
       col = "Incom Levels",
       linetype = "School Levels", y = "")
```

**Observations**

-   In this century, we can observe improvements in enrollments. Need to study more in detail.

### Scatterplots for Covariation

```{r}
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  labs(title = "School enrollment: Secondary and Tertiary vs GDP per capita", y = "")
```

**Observations**

-   Need to try log10 convertion of gdppcap as many points are near the origin.

```{r}
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  scale_x_log10() +
  labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale", y = "")
```

```{r}
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(value, gdppcap) |>
  ggplot(aes(gdppcap, value, col = name)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE, formula = 'y~x') +
  scale_x_log10() +
  labs(title = "School enrollment; Secondary and Tertiary vs GDP per capita in log10 scale", y = "")
```

```{r}
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(gdppcap, value) |>
  filter(name == "secondary") |> 
  lm(value~log10(gdppcap), data = _) |> summary()
```

```{r}
df_sec_ter_gdp_long |> filter(year == 2020) |> drop_na(gdppcap, value) |>
  filter(name == "tertiary") |> 
  lm(value~log10(gdppcap), data = _) |> summary()
```

**Observations**

-   Both secondary and tertiary level, the enrollment correlate with the log10 values of gdp per capita.

### Box Plots 

```{r}
df_sec_ter_gdp_long |> filter(year == 2020, region != "Aggregates") |> drop_na(value, region) |>
  ggplot(aes(name, value, fill = region)) + geom_boxplot() + 
  labs(title = "Secondary and tertiary school enrollment by region", y = "School enrollment (% gross)", x = "", fill = "") + 
  theme(legend.position = "top")
```

**Observations**

-   Regional differences are large in tertiary level.

```{r}
df_sec_ter_gdp_long |> filter(year == 2020, region != "Aggregates") |> drop_na(value, income) |>
  ggplot(aes(name, value, fill = factor(income, levels = INCOME))) + geom_boxplot() + 
  labs(title = "Secondary and tertiary school enrollment by income level", y = "School enrollment (% gross)", x = "", fill = "") + 
  theme(legend.position = "top")
```

**Observations**

-   Income level has more effect on school enrollment to both secondary and tertiary education



# Your Project

## Title of your project

*The title can be in the title in YAML.*

### Short Abstract

We study .....

### About Data

1.  Name of the indicator 1: Indicator Code 1

-   Description:

2.  Name of the indicator 2: Indicator Code 2

-   Description:

3.  ...

### Set up

```{r}
library(tidyverse)
library(WDI)
```

*Create data folder if you do not have it under Files.*

```{r eval=FALSE}
dir.create("data")
```

*If you do not have `wdicache.rds` in your data folder, run the following two code chunks.*

```{r wdicache, eval=FALSE}
wdicache <- WDIcache()
```

```{r writewdicache, eval=FALSE, echo=FALSE}
write_rds(wdicache, "data/wdicache.rds")
```

```{r readwdicache, echo=FALSE}
wdicache <- read_rds("data/wdicache.rds")
```

### Importing Data

Edit short_name_1, chosen_indicator_1, short_name_2, chosen_indicator_2, etc. You can edit `df_yourdata` to a descriptive name. If you edit `df_yourdata` , please edit other parts as well.

```{r}
df_yourdata <- WDI(
  indicator = c(short_name_1 = chosen_indicator_1,
                short_name_2 = chosen_indicator_2),
  extra = TRUE, cache = wdicache)
```

```{r}
write_csv(df_yourdata, "data/yourdata.csv")
```

```{r}
df_yourdata <- read_csv("data/yourdata.csv")
```

### Viewing data

```{r}

```

```{r}

```

### Creating a long table

#### Example.

```
df_sec_ter_gdp_long <- df_sec_ter_gdp |> 
  pivot_longer(cols = c(secondary, tertiary)) 
df_sec_ter_gdp_long
```


```{r}

```

### Visualizing data

```{r}

```

**Observations and Questions:**

- 


```{r}

```

**Observations and Questions:**

-   

```{r}

```

**Observations and Questions:**

-  


### A choropleth map

If possible, create a choropleth map. (a challenge, not required)

```{r}

```

**Observations and Questions:**

-  

- 
